Search CORE

48 research outputs found

A semi-automatic approach to identifying and unifying ambiguously encoded Arabic-based characters.

Author: Jaf Sardar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/03/2017
Field of study

In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them

Durham Research Online

Crossref

A Simple Approach to Unify Ambiguously Encoded Kurdish Characters

Author: Jaf Sardar
Publication venue
Publication date: 09/09/2016
Field of study

In this study we outline a potential problem in the normalisation stage of processing texts that are based on a modified version of the Arabic alphabet. The main source of resources available for processing resource-scarce languages is raw text. We have identified an interesting challenge that must be addressed when normalising certain natural language texts. Many lessresourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). It is important to identify ambiguous characters during the normalisation stage of most text processing tasks. We will demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying ambiguously encoded characters

Durham Research Online

THE APPLICATION OF CONSTRAINT RULES TO DATA-DRIVEN PARSING

Author: Jaf Sardar
Publication venue
Publication date: 31/12/2015
Field of study

The University of Manchester - Institutional Repository

Towards the Development of a Hybrid Parser for Natural Languages

Author: Jaf Sardar
Ramsay Allan
Publication venue: Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik
Publication date: 01/01/2013
Field of study

In order to understand natural languages, we have to be able to determine the relations between words, in other words we have to be able to \u27parse\u27 the input text. This is a difficult task, especially for Arabic, which has a number of properties that make it particularly difficult to handle. There are two approaches to parsing natural languages: grammar-driven and data-driven. Each of these approaches poses its own set of problems, which we discuss in this paper. The goal of our work is to produce a hybrid parser, which retains the advantages of the data-driven approach but is guided by grammar rules in order to produce more accurate output. This work consists of two stages: the first stage is to develop a baseline data-driven parser, which is guided by a machine learning algorithm for establishing dependency relations between words. The second stage is to integrate grammar rules into the baseline parser. In this paper, we describe the first stage of our work, which is now implemented, and a number of experiments that have been conducted on this parser. We also discuss the result of these experiments and highlight the different factors that are affecting parsing speed and the correctness of the parser results

Dagstuhl Research Online Publication Server

Sunderland University Institutional Repository

Deterministic choices in a data-driven parser.

Author: Jaf Sardar
Ramsay Allan
Publication venue: Libreria Editrice Cafoscarina
Publication date: 22/09/2015
Field of study

Data-driven parsers rely on recommendations from parse models, which are generated from a set of training data using a machine learning classifier, to perform parse operations. However, in some cases a parse model cannot recommend a parse action to a parser unless it learns from the training data what parse action(s) to take in every possible situation. Therefore, it will be hard for a parser to make an informed decision as to what parse operation to perform when a parse model recommends no/several parse actions to a parser. Here we examine the effect of various deterministic choices on a datadriven parser when it is presented with no/several recommendation from a parse model

Durham Research Online

The University of Manchester - Institutional Repository

On the Development of Large Scale Corpus for Native Language Identification

Author: Hudson Thomas
Jaf Sardar
Publication venue: Linkoping University Electronic Press, Sweden
Publication date: 13/12/2018
Field of study

Sunderland University Institutional Repository

Deterministic Decisions in Non-deterministic Parsing

Author: Jaf Sardar
Ramsay Allan
Publication venue: Libreria Editrice Cafoscarina
Publication date: 24/09/2015
Field of study

Sunderland University Institutional Repository

The application of constraint rules to data-driven parsing.

Author: Jaf Sardar
Ramsay Allan
Publication venue: Incoma Ltd. Shoumen, Bulgaria
Publication date: 07/09/2015
Field of study

In this paper, we show an approach to extracting different types of constraint rules from a dependency treebank. Also, we show an approach to integrating these constraint rules into a dependency data-driven parser, where these constraint rules inform parsing decisions in specific situations where a set of parsing rule (which is induced from a classifier) may recommend several recommendations to the parser. Our experiments have shown that parsing accuracy could be improved by using different sets of constraint rules in combination with a set of parsing rules. Our parser is based on the arc-standard algorithm of MaltParser but with a number of extensions, which we will discuss in some detail

Durham Research Online

Sunderland University Institutional Repository

The University of Manchester - Institutional Repository